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Abstract 

Matrix clocks are a generalization of the notion of vector clocks that allows the local representation 
of causal precedence to reach into an asynchronous distributed computation's past with depth x, 
where x > 1 is an integer. Maintaining matrix clocks correctly in a system of n nodes requires 
that every message be accompanied by 0{n x ) numbers, which reflects an exponential dependency 
of the complexity of matrix clocks upon the desired depth x. We introduce a novel type of matrix 
clock, one that requires only nx numbers to be attached to each message while maintaining what 
for many applications may be the most significant portion of the information that the original 
matrix clock carries. In order to illustrate the new clock's applicability, we demonstrate its use in 
the monitoring of certain resource-sharing computations. 
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1. Introduction and background 

We consider an undirected graph G on n nodes. Each node in G stands for a computational process 
and undirected edges in G represent the possibilities for bidirectional point-to-point communication 
between pairs of processes. A fully asynchronous distributed computation carried out by the 
distributed system represented by G can be viewed as a set of events occurring at the various 
nodes. An event is the sending of a message by a node to another node to which it is directly 
connected in G (a neighbor in G), or the reception of a message from such a neighbor, or yet the 
occurrence at a node of any internal state transition of relevance ( "state" and "relevance" here are 
highly dependent upon the particular computation at hand, and are left unspecified). 

The standard framework for analyzing such a system is the partial order, often denoted by -<, 
that formalizes the usual "happened-before" notion of distributed computing [2, 15]. This partial 
order is the transitive closure of the more elementary relation to which the ordered pair (v, v') 
of events belongs if either v and v' are consecutive events at a same node in G or v and v' are, 
respectively, the sending and receiving of a message between neighbors in G. 

At node i, we let local time be assessed by the number ij of events that already occurred. We 
interchangeably adopt the notations ti = time^v) and v = eventi(ti) to indicate that v is the ijth 
event occurring at node i, provided £j > 1. Another important partial order on the set of events 
is the relation that gives the predecessor at node j ^ i of an event v' occurring at node i. We say 
that an event v is such a predecessor, denoted by v = predj(v'), if v is the event occurring at node 
j such that v -< v' for which timej(v) is greatest. If no such v exists, then predj{v') is undefined 
and timej(predj(v')) is assumed to be zero. 

The relation predj allows the definition of vector clocks [11-14, 17], as follows. The vector 
clock of node i at time U (that is, following the occurrence of U events at node i), denoted by 
V l (ti), is a vector whose jth component, for 1 < j < n, is either equal to 0, if = 0, or given by 



if U > 1. In other words, VJ(ti) is either the current time at node i, if j = i, or is the time at 
node j that results from the occurrence at that node of the predecessor of the t^th event of node 



Vector clocks evolve following two simple rules: 

• Upon sending a message to one of its neighbors, node i attaches V l (ti) to the message, 
where ti is assumed to already account for the message that is being sent. 




(1) 



i, otherwise. If no such event exists (i.e., U = 0), then VJ(ti) = 0. 



2 



• Upon receiving a message from node k with attached vector clock V h , node i sets V?(ti) 
to ti (which is assumed to already reflect the reception of the message) and VJ(ti) to 
max{Vj(ti -l),Vf}, for j^i. 



It is a simple matter to verify that these rules do indeed maintain vector clocks consistently with 
their definition [11-14, 17]. Under these rules or variations thereof, vector clocks have proven useful 
in a variety of distributed algorithms to detect some types of global predicates [10, 14]. 

For large n, attaching a vector clock to every message that is sent is likely to become bur- 
densome, so the question arises whether less costly implementations are possible. Under the very 
general assumptions we have made concerning the nature of G as a distributed system, the answer 
is negative: a result similar to Dilworth's theorem on partially ordered sets [9] establishes that the 
size-n attachments are necessary [7]. However, it is possible to use more economical attachments 
if the edges of G provide FIFO communication [22], or if certain aspects of the structure of G can 
be taken into account [19], or yet if the full capabilities of vector clocks are not needed [1]. 

One natural generalization of the notion of vector clocks is the notion of matrix clocks [21, 
24]. For an integer x > 1, the x-dimensional matrix clock of node i at time ti, denoted by M l (ti), 
has 0(n x ) components. For 1 < ji, . . . ,j x < n, component j x (ti) of M l (ti) is only defined 

for i = ji = • • • = j x and for i ^ ji ^ ■ ■ ■ ^ j x . As in the definition of V z (ti), M^ ^^iti) = if 
ti = 0. For ti > 1, on the other hand, we have 



which, for i ^ j\ ^ ■ ■ ■ ^ j x , first takes the predecessor at node j\ of the t^th event occurring at 
node i, then the predecessor at node ji of that predecessor, and so on through node j x , whose local 



of these predecessors be undefined, the ones that follow it in the remaining nodes are undefined as 
well, and the local time that results at the end is zero. It is straightforward to see that, for x = 1, 
this definition is equivalent to the definition of a vector clock in (1). Similarly, the maintenance of 
matrix clocks follows rules entirely analogous to those used to maintain vector clocks [14]. 

While the jth component of the vector clock following the occurrence of event v' at node i ^ j 
gives the time resulting at node j from the occurrence of predj(v'), the analogous interpretation 
that exists for matrix clocks requires the introduction of additional notation. Specifically, the 
definition of a set of events encompassing the possible x-fold compositions of the relation predj 
with itself, denoted by Pred^\ is necessary. If an event v occurs at node j, then we say that 
v € Pred^°\v') for an event v' occurring at node i if one of the following holds: 




if i = j 1 = ... =j x - 

if Mil + ■■■ ^jx, 



(2) 



time after the occurrence of the last predecessor in the chain is assigned to M-^ ^ (U). Should any 
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• x = 1, j ^ i, and v = predj(v'). 

• x > 1 and there exists k ^ i such that an event v occurs at node k for which v = pred k {v') 
and v € Predj X ^(v). 

Note that this definition requires j ^ k, in addition to k ^ i, for v G Pred^\v') to hold when 
x = 2. 

For x > 1 and i ^ j 1 ^ ■ ■ ■ ^ j x = j, this definition allows the following interpretation of 
entry ji, ■ ■ ■ ,j x of the matrix clock that follows the occurrence of event v' at node i, that is, of 
Mj x j x i j(timei(v')} . It gives the time resulting at node j from the occurrence of an event that 
is in the set Pred^ x \v'), so long as it is nonempty. Of course, the number of possibilities for such 
an event is 0(n x ~ 1 ) in the worst case, owing to the several possible combinations of ji, . . . ,j x —i- 

Interestingly, applications that require the full capabilities of matrix clocks have yet to be 
identified. In fact, it seems a simple matter to argue that a slightly more sophisticated use of 
vector clocks or simply the use of two-dimensional matrix clocks (the x = 2 case) suffices to 
tackle some of the problems that have been offered as possible applications of higher-dimensional 
matrix clocks [14, 18]. What we do in this paper is to demonstrate how the distributed monitoring 
of certain resource-sharing computations can benefit from the use of matrix clocks and that, at 
least for such computations, it is possible to employ much less complex matrix clocks (that is, 
matrix clocks with many fewer components) , which nonetheless retain the ability to reach into the 
computation's past with arbitrary depth. 

The key to this reduction in complexity is the use of one single event in place of each set 
Pred^ y \v'), for 1 < y < x. In other words, for each node j the matrix clock we introduce retains 
only one of the 0{n y ~ l ) components of each of the x y-dimensional original matrix clocks — one 
component from the vector clock, one from the two-dimensional matrix clock, and so on through 
one component of the x-dimensional matrix clock. As we argue in Section 3, following a brief 
discussion of the resource-sharing context in Section 2, this simplification leads to matrix clocks of 
size nx, therefore considerably less complex than the 0(n x )-sized original matrix clocks. 

The single other attempt at reducing the complexity of a matrix clock that we know of was 
given for the x = 2 case specifically and culminated with the introduction of two techniques [20]. 
The first one requires attachments of size 0(n) (a considerable reduction from the original 0(n 2 )), 
but is only applicable if the full asynchronism we have been assuming does not hold; it is therefore 
of little interest in our context. The other technique is somewhat closer to our own in spirit, 
since it aims at approximating the two-dimensional matrix clock by retaining only k of the 0{n) 
components that correspond to each node. However, the criterion to select the components to be 
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retained is to choose the k greatest components, which seems unrelated to our own criterion, based 
as it is on the Predj sets. 

We give concluding remarks in Section 4, after a discussion of how the technique of Section 3 
can successfully solve the problem posed in Section 2. 

2. Monitoring resource-sharing computations 

The resource-sharing computation we consider is one of the classical solutions to the paradigmatic 
Dining Philosophers Problem (DPP) [8] in generalized form [6]. In this case, G's edge set is 
constructed from a given set of resources and from subsets of that set, one for each node, indicating 
which resources can be ever needed by that node. This construction places an edge between nodes 
i and j if the sets of resources ever to be needed by i and j have a nonempty intersection. Notice 
that this construction is consonant with the interpretation of edges as bidirectional communication 
channels, because it deploys edges between every pair of nodes that may ever compete for a same 
resource and must therefore be able to communicate with each other to resolve conflicts. 

In DPP, the computation carried out by a node makes it cycle endlessly through three states, 
which are identified with the conditions of being idle, being in the process of acquiring exclusive 
access to the resources it needs, and using those resources for a finite period of time. While in the 
idle state, the node starts acquiring exclusive access to resources when the need arises to compute 
on shared resources. It is a requirement of DPP that the node must acquire exclusive access to all 
the resources it shares with all its neighbors, so it suffices for the node to acquire a token object 
it shares with each of its neighbors (the "fork," as it is called), each object representing all the 
resources it shares with that particular neighbor. When in possession of all forks, the node may 
start using the shared resources [2, 6]. 

The process of collecting forks from neighbors follows a protocol based on the sending of 
request messages by the node that needs the forks and the sending of the forks themselves by the 
nodes that have them. More than one protocol exists, each implementing a different rule to ensure 
the absence of deadlocks and lockouts during the computation. The solution we consider in this 
section is based on relative priorities assigned to nodes. Another prominent solution is also based 
on the assignment of relative priorities, but to the resources instead of to the nodes [16, 23]. 

The priority scheme of interest to us is based on the graph-theoretic concept of an acyclic 
orientation of G, which is an assignment of directions to the edges of G in such a way that directed 
cycles are not formed. Such an acyclic orientation is then a partial order of the nodes of G, 
and is as such suitable to indicate, given a pair of neighbors, which of the two has priority over 
the other. Most of the details of this priority scheme are not relevant to our present discussion, 
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but before stating its essentials we do mention that the lockout-freedom requirement leads the 
acyclic orientation to be changed dynamically (so that relative priorities are never fixed), which 
in turn leads to a rich dynamics in the set of all the acyclic orientations of G and to important 
concurrency-related issues [3-5] . 

Once a priority scheme is available over the set of nodes, what matters to us is how it is used 
in the fork-collecting protocol. When a request for fork arrives at node j from node i, j sends i the 
fork they share if j is either idle or is also collecting forks but does not have priority over i. If j is 
also collecting forks and has priority over i, or if j is using shared resources, then the sending of 
the fork to i is postponed to until j has finished using the shared resources. Note that two types 
of wait may happen here. If j is using shared resources when the request arrives, then the wait is 
independent of n. If j is also collecting forks, then the wait for j to start using the shared resources 
and ultimately to send i the fork is in the worst case n — 1 [2, 4, 6]. The reason for this is simple: j 
is waiting for a fork from another node, which may in turn be waiting for a fork from yet another 
node, and so on. Because the priority scheme is built on acyclic orientations of G, such a chain of 
waits does necessarily end and is n — 1 nodes long in the worst case. 

Whether such long waits occur or not is of course dependent upon the details of each execution 
of the resource-sharing computation. But if they do occur, one possibility for reducing the average 
wait is to increase the availability of certain critical resources so that G becomes less dense [2, 4, 
5]. Perhaps another possibility would be to fine-tune some of the characteristics of each node's 
participation in the overall computation, such as the duration of its idle period, which could be 
subject to a mandatory lower bound, for example. In any event, the ability to locally detect long 
waits (a global property, since it relates to the notion of time in fully asynchronous distributed 
systems [2]) and identify the nodes at which the wait chains end is crucial. 

To see how this relates to the formalism of Section 1, suppose our resource-sharing computation 
consists of the exchange of fork-bearing messages only (that is, request messages and all other 
messages involved in the computation, such as those used to update acyclic orientations, are 
ignored). For distinct nodes i and j, and for an event v' occurring at node i at the reception of 
a fork, the set Pred^\v') for x > 1 is either empty or only contains events that correspond to 
the sending of forks by node j. Now suppose we systematically investigate the sets Pred^\v') for 
every appropriate j by letting x increase from 1 through n — 1. Suppose also that, for each j, we 
record the first x that is found such that Predj X \v') contains the sending of a fork as response 
to the reception of a request message without any wait for fork collection on the part of j. The 
greatest such value to be recorded, say x*, has special significance: it means that the eventual 
reception of a fork by node i through v' is the result of an x*-long chain of waits. 
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If this were the only information of interest, then Lamport's clocks [15] could be used trivially 
to discover it. However, taking corrective measures may require a wider array of long wait chains 
to be found (not simply the longest), as well as the nodes at which those chains end. The matrix 
clock that we introduce in Section 3 is capable of conveying this information to node i succinctly, so 
long as there exists enough redundancy in the sets Pred^\v') that attaching only nX integers to 
forks suffices, where X is a threshold in the interval [1, n — 1] beyond which wait chains are known 
not to occur, given the structure of G and the initial arrangement of priorities. 1 It so happens that 
such redundancy clearly exists: the only events in Pred^ x \v') that matter are those corresponding 
to the sending of forks without any wait for fork collection. Detecting any one of them suffices, so 
we may as well settle for the latest, that is, one single event from the whole set. We return to this 
in Section 4. 

3. A simpler matrix clock 

For x > 1, the new matrix clock we introduce is an x x n matrix. For node i at time ti (i.e., 
following the occurrence of U events at node i), it is denoted by C l (ti). For 1 < y < x and 
1 < j < n, component C*j(fj) of C l (ti) is defined as follows. If ti = 0, then C*^(tj) = 0, as for 
vector clocks and the matrix clocks of Section 1. If ti > 1, then we have 

{ti, if y = 1 and j = i; 

1'iax; .. /: ,{/'"",, (pred jy . . . pred h (ei>eni;(i;))) j, if y > 1 or j ^ i. ' 

Note, first of all, that for y = 1 this definition is equivalent to the definition of a vector clock in 
(1). Thus, the first row of C l (ti) is the vector clock V l (ti). For y > 1, the definition in (3) implies, 
according to the interpretation of matrix clocks that follows our definition in (2), that 



max{ timej (v) | v G Predf ] (v')}, if Predf ] (v') ^ 0; 

if Pred^\v') = 0, 



where v' is the t^th event occurring at node i. What this means is that, of all the 0(n y_1 ) events 
that may exist in Pred^\v'), only one (the one to have occurred latest at node j) makes it to the 
matrix clock C l (ti). Note also that (3) implies the equality in (4) for y = 1 as well, so long as 
j ^ i. In this case, Pred^\v'), if nonempty, is the singleton {predj(v')} . 



1 Discovering the value of X is not necessarily a simple task, but some empirical knowledge has 
already been accumulated for modestly-sized systems [3]; in any case, in the likely event that X 
cannot be determined with certainty, there is always the possibility of adaptation as the resource- 
sharing computation is observed on the system at hand. 



v' 




Figure 1. A computation fragment on six nodes. 

Before we derive the update rules for our new matrix clock, let us pause and examine an 
example. Consider Figure 1, where a computation on six nodes is illustrated by means of the usual 
message diagram that forms the basis of the relation -<. Nodes are numbered 1 through 6, and in 
the figure local time elapses from left to right independently for each node. Filled circles represent 
events and the arrows connecting two events indicate messages. 

Three events are singled out in Figure 1, namely v' , v\, and v 2 - They are related to one 
another in such a way that Pred^' \v') = {v\, 1)2} , that is, v\ and V2 are the "depth-4" predecessors 
of v' at node 6. Recalling that timei(v') = 3, there are two components of the four-dimensional 
matrix clock M 1 (3) that reflect this relationship among the three events, namely 

^3,4,5,6(3) = «me 6 (t>i) = 2 

and 

M 2,3,4,6( 3 ) = time 6 (v 2 ) = 3. 

These follow directly from (2) with x = 4. By (3) with y = 4, the same diagram of Figure 1 
yields 

6*4 5(3) = max^tiTne6[pred 6 pred 5 pred 4 pred 3 (v')), time6(pred e pred 4 pred 3 pred 2 (v'))^ 
= ma~x.{timee(vi) , time§{v 2 )} 
= 3, 

which is clearly also in consonance with (4) . 
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Let us now look at the update rules. For y = 1, these rules are the same as those given for 
vector clocks in Section 1. For y > 1, we know from the definition of Pred^ that v 6 Pred^\v') 
if and only if there exists k ^ i such that an event v occurs at node k for which v = pred k (v') and 



where K l (ti,y,j) is the set comprising every appropriate k. Notice, in (5), that k = j can never 
occur as the maximum is taken if y = 2: aside from i, node j is the only node that cannot possibly 
be a member of K l (ti,2,j), by (3). 

According to (5), and recalling once again the special case of y = 1, we are then left with the 
following two rules for the evolution of our new matrix clocks: 

• Upon sending a message to one of its neighbors, node i attaches C l (ti) to the message, 
where is assumed to already account for the message that is being sent. 

• Upon receiving a message from node k with attached matrix clock C k , and assuming that 
ti already reflects the reception of the message, node i sets Cyj(U) to 



According to these rules, every message carries an attachment of nx integers. 
4. Discussion and concluding remarks 

We are now in a position to return to the problem posed in Section 2, namely the problem of 
monitoring executions of the solution to DPP that employs a partial order on G's set of nodes to 
establish priorities. As we discussed in that section, the overall goal is to allow nodes to detect 
locally, upon receiving a fork, whether the delivery of that fork is the result of a chain of fork 
deliveries that started too far back in the past. In the affirmative case, the wait since the fork was 
requested will have been too long, in terms of the usual notions of time complexity in asynchronous 
distributed computations. 

More specifically, if v' is the event corresponding to the reception of a fork at node i, then the 
goal is for i to be able to detect the existence of an event in Pred^\v') that corresponds to the 
sending of a fork by node j either immediately upon the reception of the request for that fork or, 




maxjC^-l),^.}, 
maxjC^-l),^.}, 



if y = 1 and j = i; 
if y = 1 and j ^ i; 
if y = 2 and j = k; 



if 1 < y < x, provided y > 2 or j ^ k. 
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if node j was using shared resources when the request arrived, immediately upon finishing. Here 
1 < V < X and j is any node, provided y = 1 and j = i do not occur in conjunction. The value of 
X is such that 1 < X < n — 1, and is chosen as a bound to reflect the maximum possible chain of 
waits. As a consequence, the sets Pred^{v') must include fork-related events only. 

The new matrix clocks introduced in Section 3 can be used for this detection with only minimal 
adaptation. The key ingredients are: 

• The only messages sent by node i to be tagged with the matrix clock C % are forks. If the 
sending of a fork by node i does not depend on further fork collection by i, then every 
component of C l other than C % X i is reset to zero before it is attached to the fork. Matrix 
clocks are X x n, and all further handling of them follows the rules given in Section 3. 

• Upon receiving a fork with attached matrix clock, and having updated C % accordingly, node 
i looks for components of C % that contain nonzero values. If C l y • is one such component for 
y > 1 or j 7^ i, then a wait chain of length y that ends at node j has been discovered and 
can be checked against a certain threshold X' < X representing the maximum allowable 
chain length. If C l contains zero components at all positions but (l,i), then it is certain 
that the value of X has to be revised, as clearly a wait chain exists whose length surpasses 
X. A greater value for X is then needed. 

This strategy reflects the general method of Section 3 when applied to events that relate to the 
flow of forks only. Whenever the request for a fork is received and the fork can be sent without the 
need for any forks to be received by the node in question, say node j, zeroes get inserted into the 
matrix clock at all positions but and are sent along with the fork. The sending of this fork 

may unleash the sending of forks by other nodes in a chain of events, and along the way the original 
value of C( ■ may never be reset to zero, reflecting the increasing length of the wait chain rooted at 
node j. The reception of a fork whose matrix clock has such a nonzero component beyond row X' 
is then an indication that such reception is part of chains whose lengths are considered too long. 

It is now worth returning to Figure 1 in order to view its diagram as representing the fork- 
bearing messages in a fragment of a DPP computation of the type we have been considering. For 
such, consider Figure 2, where two six-node graphs are depicted. Figure 2(a) shows the graph G 
corresponding to a certain resource-sharing pattern for the six nodes; it shows an acyclic orientation 
of the edges of G as well, indicating the initial arrangement of priorities. In a situation of heavy 
demand for resources by all six nodes, only node 6 can acquire all the (three) forks it needs and 
proceed. All others must wait along the length-5 wait chain shown in Figure 2(b): node 1 for node 
2, 2 for 3, and so on through node 6. 
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Figure 2. G oriented acyclically (a) and a possible wait chain (b). 



Eventually, it must happen that node 6 sends out its three forks. If we assume that the 
corresponding request messages arrive when node 6 already holds all three forks, then upon sending 
them out all components of C 6 are sent as zeroes, except for Cf 6 , sent as 1, 2, or 3, depending on 
the destination. For X = 5, at the occurrence of v' node 1 updates its matrix clock in such a way 



In summary, we have in this paper introduced a novel notion of matrix clocks. Similarly to the 
original matrix clocks [21, 24], whose definition appears in (2), our matrix clock has the potential 
of reflecting causal dependencies in the flow of messages that stretch as far as depth x into the past. 
Unlike those original matrix clocks, however, ours increases with x as nx, while in the original case 
the growth is according to an 0(n x ) function, that is, exponentially. 

We have illustrated the applicability of the new matrix clocks with an example from the area of 
resource-sharing problems. What we have demonstrated is a means of collecting information locally 
during the resource-sharing computation so that exceedingly long global waits can be detected, 
possibly indicating the need for overall system re-structuring, as described in Section 2. 
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